1 EXECUTIVE SUMMARY

88 % prediction accuracy has been reached on the validation set, against 50 % with a baseline model. Data is an Amazon sample provided in UCI Machine Learning Repository.

In this sentiment analysis project, which factors have contributed towards that improvement with 38 percentage points?

Natural Language Processing has contributed 21.7 percentage points: corpus, lowercasing, punctuation handling, stopword removal, stemming, tokenization from sentences into words, bag of words.

Text mining has brought additional accuracy improvement with 12.7 percentage points. The following insights have been determinant.

In decision trees predominate some tokens conveying subjective information; but other tokens containing subjective information have not been used in false negatives and false positives. Such ignored subjective information has been retrieved from random samples of false negatives and false positives, exclusively on the training set; customized lists have been established with tokens sorted as having either positive or negative sentiment orientation; occurrences of these tokens in reviews have been replaced with either a positive or a negative generic token. Polarization and text substitution have brought 10.3 percentage points out of the 12.7.

Another insight has been about negation impact: negation has been fruitfully integrated, contributing 2.4 percentage points towards the 12.7 improvement from text mining.

Machine learning optimization has been performed across 10 models. Testing has been conducted on accuracy distributions across bootstrapped resamples. eXtreme Gradient Boosting has emerged as the most performing model in this project and has boosted accuracy with 3.6 additional percentage points.


TAGS: sentiment analysis, natural language processing, text mining, subjective information, tokenization, bag of words, word frequency, interactive wordclouds, graphs, and tables, decision trees, false negatives, false positives, text classification, polarization, lists of positive n-grams, lists of negative n-grams, text substitution, machine learning, binary classification, eXtreme Gradient Boosting, Monotone Multi-Layer Perceptron Neural Network, Random Forest, Stochastic Gradient Boosting, Support Vector Machines with Radial Basis Function Kernel, AdaBoost Classification Trees, bootstrapping, accuracy distributions across resamples, R


GITHUB: https://github.com/Dev-P-L/Sentiment-Analysis


1.1 II. FOREWORD to READERS

Dear Readers, you are most welcome to run the project on your own computer if you so wish.

This project is lodged with the GitHub repository https://github.com/Dev-P-L/Sentiment-Analysis.

It is comprised of twelve files. All code is included in SA_Amazon_Code.Rmd. It does not show in the result report, called SA_Amazon_Insights&Results.html.

For your convenience, the dataset has already been downloaded onto the GitHub repository wherefrom it will be automatically retrieved by the code from SA_Amazon_Code.Rmd. If you so wish, you can also easily retrieve the dataset from https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences and adapt the SA_Amazon_Code.Rmd code accordingly.

You can knit SA_Amazon_Code.Rmd (please in HTML) and produce SA_Amazon_Insights&Results.html on your own computer. On my laptop, running SA_Amazon_Code.Rmd takes approximately four hours. For information, here are some characteristics of my work environment:

  • R Version 4.0.3 (2020-10-10) – “Bunny-Wunnies Freak Out”,
  • RStudio Version 1.3.1093 (2020-09-17) – “Apricot Nasturtium”,
  • Windows 10.

Some packages are required in SA_Amazon_Code.Rmd. The code from SA_Amazon_Code.Rmd contains instructions to download these packages if they are not available yet.

Now, let’s get in touch with data.


1.2 III. GETTING IN TOUCH with DATA

As explained on the UCI Machine Learning Repository website, data is organized in a CSV file in two columns. In the first column, there are 1,000 Amazon product reviews (sentences). In the second column, there is a positive or negative evaluation; the ratio of positive evaluations is 50 %.

That file will be split into training reviews - two thirds of reviews - and validation reviews. Let’s have a quick look at the number of positive and negative reviews in the training set.


Review Polarity Number of Reviews in Training Set
Pos 334
Neg 334


Let’s have a look at a few training reviews.



Let’s proceed to some NLP.


1.3 NATURAL LANGUAGE PROCESSING

Let’s process reviews to …


1.3.1 Corpus - Tokenization - Bag of Words

Training reviews will be transposed into a corpus. Then the corpus will be processed in NLP: words will be lowercased, punctuation marks will be removed as well as stopwords and finally words will be stemmed.

Tokenization will then take place, a bag of words being created. The bag of words takes the form of a Document Term Matrix: the 668 rows correspond to the 668 training reviews; there is a column for each token. At the junction of each row and each column, there is a frequency number representing the occurrence of the corresponding token in the corresponding review.

Applying a sparsity threshold of .995 will only leave tokens that appear in at least 0.5 % of reviews.

As a pre-attentive insight, a wordcloud will show the most frequent tokens. The wordcloud is interactive: just hover over a token and you get the frequency of occurrence.

There are topic-related tokens such as “phone”, tokens conveying subjective information such as “great”, etc. Before analyzing token categories, let’s check up the technical adequacy of results from the NLP process.


1.3.2 B. Checking up NLP Output

Some tokens were not expected, such as “dont” or “ive”, since they seem to originate in short forms and were expected to have been eliminated as stopwords.

Let’s start investigating with “dont”. The frequency of occurrence is at least 10 since that is a prerequisite to enter the wordcloud. But there can be more instances.


# REVIEWS CONTAINING “dont”
Bag of Words from Training Reviews 20


Perusing training set reviews containing “dont” has led to distinguishing two scenarios. The first one is an exception, but it can be generalized to other tokens. Here it is.


REVIEW GENERATING “dont” IN BAG OF WORDS
Set of Training Reviews dont buy it.


“dont” contains a spelling error or is, in a more inclusive wording, “alternative” grammar: it has been used instead of “don’t”. Actually, there is only one such case in the bag of words. But it could happen more often and also with other short forms such as “couldn’t”, “isn’t”, … becoming “couldnt”, “isnt”, …

We are going to treat these misspelled short forms as if they were standardly written. We will complement stopwords with variants such as “dont”, “couldnt”. Consequently, when we remove stopwords, the misspelled short forms can be eradicated as well as the standardly written short forms, at least for the mispelled short forms we can think of… Complementing stopwords with misspelled short forms will be done in the next section “C. Fine Tuning NLP”.

Now, let’s have a look at the most common scenario that has generates “dont”. Let’s just show the one review with two occurrences.


REVIEW GENERATING “dont” IN BAG OF WORDS
Set of Training Reviews Don’t trust their website and don’t expect any helpful support.


This is the general scenario: “don’t” has been written in a standard way; but all punctuation marks have been removed and consequently it has become “dont”; it is no longer identical to the stopword “don’t” and, very logically, it has not been removed.

This scenario happened 19 times for “don’t” while processing the training set. Moreover, it was the same scenario for the other short forms.

In order to prevent that scenario from happening, there are simple solutions, e.g.:

  • discarding stopwords, and consequently short forms, before removing punctuation;
  • or, removing punctuation marks with the exception of apostrophes, discarding stopwords, and consequently short forms, and only then removing the remaining apostrophes (apostrophes present at other places than in short forms).

An appropriate solution will be applied in the next section “C. Fine Tuning NLP”.

While examining other tokens coming from training reviews, other oddities have been discovered. There are several unigrams that seem to originate from two words, e.g. “brokeni” at row 24 of all tokens sorted alphabetically (i.e. all tokens before applying the sparsity process described above).


24 breakag brilliant broke brokeni brows


Which training review does “brokeni” come from?


REVIEW PRODUCING “brokeni”
Set of Training Reviews I got the car charger and not even after a week the charger was broken…I went to plug it in and it started smoking.


What happened? Well, “broken…I” was first lowercased to “broken…i”, then punctuation was removed by the function removePunctuation(), which does not insert any white space character, and “broken…i” has become “brokeni”.

This has to be corrected of course for “brokeni” but also for similar cases. In the next section “C. Fine Tuning NLP”, a general solution will be applied.


1.3.3 C. Fine Tuning NLP

Instead of using the function removePunctuation() from the package tm, specific “for loops” will be developed, preprocessing reviews according to the needs stated above and in a stepwise way:

  • punctuation marks other than apostrophes will be replaced with white space characters instead of just being removed;
  • short forms will be removed;
  • remaining apostrophes will be replaced with white space characters;
  • other stopwords will be removed (it is done in step 4 and not in step 2 in order to do it when absolutely all punctuation marks have been removed: please see example with “brokeni” where two words and one punctuation mark are stuck together…).

Among stopwords, short forms (contractions) need to be specifically treated. Additional needs of breakdown might also emerge. Starting from the stopword list delivered by the function stopwords(“english”) from the package tm, four CSV files will be produced.

These are the four files:

  • short_forms_pos.csv, with all positive short forms from stopwords(“english”) such as “she’s”, a few additional ones and numerous misspelled variants such as “she s” or “shes”;
  • short_forms_neg.csv, in the same approach, for short forms such as “isn’t”, “daren’t” but also “isn t”, “isnt”, etc.;
  • negation.csv, with seven negational unigrams such as “not” or “no”;
  • stopwords_remaining.csv, which is self-explanatory.

The 4 files have been uploaded to the GitHub repository https://github.com/Dev-P-L/Sentiment-Analysis. They are going to be downloaded now and integrated into NLP pre-processing.

Let’s rebuild the corpus, the bag of words and the interactive wordcloud (just hover over tokens to get the frequency of occurrence).

In the wordcloud, there is no more token originating from short forms.

Let’s have a broader look, building up a presentation table and checking whether all abovementioned oddities have disappeared. Let’s check up in the bag of words whether “dont” has indeed disappeared.


51 dit dock done doubl download


Yes, indeed, “dont” has disappeared. Let’s check up in the same way for “ive”!


96 issu item jabra jawbon jerk


“ive” has also disappeared. Now “brokeni”.


21 breakag brilliant broke broken brows


“brokeni” has vanished as well, just as many other oddities. I leave uncorrected some spelling errors, such as “disapoint” or “dissapoint”, because this is no repetitive structure and occurrence seems marginal.

Let’s have a first try at predicting sentiment on the basis of sentSparse_av0, which originates from our training set.


1.3.4 D. Measuring NLP Impact on Prediction Accuracy

NLP impact will be computed as the gain in accuracy provided by a standard machine learning model in comparison with the baseline model.

The chosen machine learning model will be CART: it runs rather quickly and delivers clear decision trees. Running function rpart() on the training set delivers the accuracy level mentioned hereunder.


ACCURACY ON THE TRAINING SET
Model: CART 0.768


Now let’s train the rpart method with the train() function from the package caret.

By default, the train() function would train across 3 values of cp (the complexity parameter) and 25 bootstrapped resamples for each tuned value of cp. As far as the number of tuned values is concerned, let’s upgrade it to 15 to increase the odds of improving accuracy, especially as rpart runs rather quickly.

The default resampling method is bootstrapping, samples being built with replacement, some reviews being picked up twice or more and some other reviews not being selected. This method seems especially appropriate here because the size of each resample will be the same of the size of the training set, which is already limited, i.e. 668. Working with e.g. K-fold cross-validation would imply further splitting the training set.

Will accuracy improve?


ACCURACY ON THE TRAINING SET
Model: CART + cp Tuning 0.7814


Accuracy increases somewhat, i.e. from 77.8 to 78.6. For the record, let’s have a look at a graph showing how accuracy evolves across the 15 cp values chosen by the train() function.



On the graph above, maximum accuracy is a bit lower than the level previously indicated. Why is it different? Because, on the graph, it is, for each cp value, the average accuracy on the 25 bootstrapped resamples, while accuracy previously given related to the whole training set.

The optimal value of cp is near zero. Is it really zero?


OPTIMAL cp VALUE
Model: CART + cp Tuning 0.0000


Yes, it is zero. This means that the train() function has kept the decision tree as complex as possible by assigning a zero value to the complexity parameter.

On the whole training set, the rpart model without tuning delivers approximately 78 % accuracy and the rpart model with tuning 79 %. Both levels are substantially higher than accuracy provided by the baseline model.

The baseline model would predict a positive evaluation for all reviews (or alternatively a negative evaluation for all reviews) since prevalence is 50 %. Prevalence should show in the accuracy level delivered by the baseline model on the training set. Let’s check it up.